This analysis aims to provide some insight about the car crashes in New York during the period of 2016 to 2022, focusing on some key factors such as vehicle type, hour and weather.
We count with 2 datasets, one cointaining the vehicle crashes and the second one about the weather of the location.
| Name | Rows | Columns | Each row is a | Link |
|---|---|---|---|---|
| Vehicles dataset | 2.11M | 29 | Motor Vehicle Collision | Vehicles dataset |
| Weather dataset | 59,760 | 10 | Time stamp of Weather | Weather dataset |
In this step we will collect the data from the datasets, clean it and merge it to create a comprehensive dataset for analysis. Before merging, the data needs to be cleaned and enriched with additional information. Also shrinking the data to make it more manageable.
The data cleaning process involves removing columns with high NA ratios, filtering out rows with missing values, and creating new columns to categorize the main causes of accidents. The data is then enriched with additional information such as the day of the week, month, quarter, year, and time of day.
Both datasets, specially vehicles.csv, contain a lot of
rows which require a lot of memory and time to process. For this reason,
we decide to eliminate rows with missing values and columns with high NA
ratios, such as vehicle type 3, 4 and 5 as there are very few values in
these columns (multiple vehicle accidents).
Weather.xlsx is a smaller dataset and the cleaning
process is simpler, we just need to convert the time column to a correct
format and rename some columns for better understanding. Finally the
data is then merged with the weather data to create a comprehensive
dataset for analysis.
The result is:
| Name | Rows | Columns | Each row is a |
|---|---|---|---|
| Merged dataset | 1M | 40 | Combination of vehicle and weather data |
After merging the data, we will save each year to a separate file to make it easier to analyze the data by year. We will perform both analyses on the full dataset and on years separately.
In this graph we can see the total number of accidents per year in New York from 2016 to 2022. The number of accidents seems to be decreasing over the years, which is a positive trend.
It is important to note that we cannot draw any conclusions from this graph alone, as there may be other factors influencing the number of accidents, such as the pandemic of COVID-19 and the lockdowns that occurred in 2020 and 2021, which could have reduced the number vehicles on the road and therefore the number of accidents.
## [1] "Correlation coefficient: 0.59"
This graph shows the correlation between the total number of
accidents and the total rainfall per month in New York in 2021. The
result of 0.59 indicates a moderate positive correlation
between the two variables, suggesting that higher rainfall may lead to
more accidents.
Monthly Number of Accidents by Rainfall Category with Monthly Rainfall
This graph shows the monthly number of accidents in New York in 2021, categorized by rainfall intensity. The black line represents the total number of accidents per month, while the blue bars represent the total rainfall per month on a secondary y-axis. The dots represent the number of accidents in each rainfall category.
It can be appreciated that the number of accidents tends to increase with higher rainfall, especially in the “1 mm - 4 mm (Light rain)” and “>4 mm - 7 mm (Moderate rain)” categories.
At the same time, the majority of accidents occur in no rain conditions, which could be due simply to the fact that most of the time there is no rain in New York. Hence, ithout the a total amount of vehicles on the road, it is difficult to draw conclusions from this data alone.
Here the goal is to analyze the distribution of accidents by hour, day of the week, and month.
Hopefully, this analysis will provide insights into the temporal patterns of accidents in New York, helping to identify high-risk and safer periods.
Esquisse is a package that allows you to create interactive plots and dashboards in R. It is similar to Tableau in that it provides a user-friendly interface for creating visualizations without writing code. After loading your data, you can create plots interactively by dragging and dropping variables.
To use Esquisse, you need to install the package and then load it in
your R script. After loading the package, you can launch the web app by
calling the esquisser() function with your data as an
argument. That will open a web browser with the Esquisse interface,
where you can create plots interactively by dragging and dropping
variables. Esquisse works with the plotly package to create
interactive plots.
NOTE: Trying to load the data was tricky, Esquisse would not load the data from the merged dataset on the code. Opening Esquisse without an argument would start the app and prompt to load the data from the interface, but it would not load the data from the code.
Now trying to load the data from the interface, a warning message on the UI finally tells the problem, that loading more than 5MB of data was not possible.
This is a limitation of the Shiny app, which is used to create the Esquisse interface. The maximum request size for Shiny apps is 5MB by default, which is not enough for our dataset. It could have been a deal breaker for the use of Esquisse in this project and any other with a large dataset.
To solve this issue, we can increase the maximum request size for
Shiny apps by setting the options(shiny.maxRequestSize)
option to a higher value. In this case, we set it to 300MB, which should
be enough for our dataset.
At the botom left of the pane, in the options tab, you can select to
make the plots with plotly to make them interactive. Once
active, you can hover over the plots to see the data points and values
or click on the legend to filter the data.
In this section, we will create an interactive map of accidents in
New York using the plotly package. The map will show the
density of accidents based on latitude and longitude coordinates, with
color indicating the density of accidents in each area. It will be an
interactive map that allows you to zoom in and out and hover over data
points to see more information.
In the first view, it can look like that the whole New York is yellow, but if you zoom in, you will see the density of accidents in each area. This is to be expected and is due to the high number of inhabitants and accidents in New York.
It is interesting to see that the accidents are concentrated in certain areas, such as Manhattan and Brooklyn, which are more densely populated and have more traffic. Also, we can see a trend that horizontal roads (East-West) have more accidents than vertical roads (North-South).
Also, as expected, the accidents are more concentrated in main roads and highways, where the traffic is heavier, until a bridge is reached. In the map, we can see that virtually all bridges are free of accidents, which is a good sign for traffic safety.